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ABSTRACT 

Perfect hash functions can potentially be used to compress 
data in connection with a variety of data management tasks. 
Though there has been considerable work on how to con- 
struct good perfect hash functions, there is a gap between 
theory and practice among all previous methods on min- 
imal perfect hashing. On one side, there are good the- 
oretical results without experimentally proven practicality 
for large key sets. On the other side, there are the the- 
oretically analyzed time and space usage algorithms that 
assume that truly random hash functions are available for 
free, which is an unrealistic assumption. In this paper we 
attempt to bridge this gap between theory and practice, us- 
ing a number of techniques from the literature to obtain a 
novel scheme that is theoretically well-understood and at the 
same time achieves an order-of-magnitude increase in per- 
formance compared to previous "practical" methods. This 
improvement comes from a combination of a novel, theo- 
retically optimal perfect hashing scheme that greatly sim- 
plifies previous methods, and the fact that our algorithm is 
designed to make good use of the memory hierarchy. We 
demonstrate the scalability of our algorithm by considering 
a set of over one billion URLs from the World Wide Web of 
average length 64, for which we construct a minimal perfect 
hash function on a commodity PC in a little more than 1 
hour. Our scheme produces minimal perfect hash functions 
using slightly more than 3 bits per key. For perfect hash 
functions in the range {0, . . . , 2n — 1} the space usage drops 
to just over 2 bits per key (i.e., one bit more than optimal 
for representing the key) . This is significantly below of what 
has been achieved previously for very large values of n. 

1. INTRODUCTION 

Some types of databases are updated only rarely, typically 
by periodic batch updates. This is true, for example, for 
most data warehousing applications (see [32] for more ex- 
amples and discussion). In such scenarios it is possible to 
improve query performance by creating very compact rep- 
resentations of keys by minimal perfect hash functions. In 



applications where the set of keys is fixed for a long period 
of time the construction of a minimal perfect hash function 
can be done as part of the preprocessing phase. For exam- 
ple, On-Line Analytical Processing (OLAP) applications use 
extensive preprocessing of data to allow very fast evaluation 
of certain types of queries. 

Perfect hashing is a space-efficient way of creating compact 
representation for a static set S of n keys. For applications 
with successful searches, the representation of a key x G S 
is simply the value h{x), where ft is a perfect hash function 
for the set S of values considered. The word "perfect" refers 
to the fact that the function will map the elements of S to 
unique values (is identity preserving) . Minimal perfect hash 
function (MPHF) produces values that are integers in the 
range [0, n — 1], which is the smallest possible range. It is 
known that 0{n) bits suffice to store a minimal perfect hash 
function, and there are theoretical results that use around 
1.4427n bits, asymptotically for large n |17) . 

We now present some examples where minimal perfect hash 
functions have successfully been applied to: 

• A perfect hash function can be used to implement a 
data structure with the same functionality as a Bloom 
filter [27]. In many applications where a set S of ele- 
ments is to be stored, it is acceptable to include in the 
set some false positive^ with a small probability by 
storing a signature for each perfect hash value. This 
data structure requires around 30% less space usage 
when compared with Bloom filters, plus the space for 
the perfect hash function. Bloom filters have applica- 
tions in distributed databases and data mining (asso- 
ciation rule mining [71[S]). 

• Perfect hash functions have also been used to speed up 
the partitioned hash-join algorithm presented in [25] . 
By using a perfect hash function to reduce the targeted 
hash-bucket size from 4 tuples to just 1 tuple they 
have avoided following the bucket-chain during hash- 
lookups that causes too many cache and translation 
lookaside buffer (TLB) misses. 

• Suppose there is a composite foreign key to a table 
T of size n. Then the size of the key needed in T 
will typically be larger than log n. For example, sup- 
pose tuples of R contain geographical coordinates that 

^False positives are elements that appear to be in S but are 
not. 
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are used as foreign key references. Replacing the co- 
ordinates with a surrogate key may be a bad choice 
if common queries on R involve conditions on the co- 
ordinates, as a join would be required to retrieve the 
coordinates. In general, if we have a natural foreign 
key that carries information relevant for queries, we 
can avoid the cost of additionally storing a surrogate 
key. 

The perfect hash function used depends on the set S of dis- 
tinct attribute values that occur. It is known that maintain- 
ing a perfect hash function dynamically under insertions into 
S is only possible using space that is super-linear in n [llj . 
However, in this paper we consider the case where S is fixed, 
and construction of a perfect hash function can be done as 
part of the preprocessing of data (e.g., in a data warehouse). 
To the best of our knowledge, previously suggested perfect 
hashing methods have not been even close to a space usage 
of 1.4427n bits for realistic data sizes. Second, all previous 
methods suffer from either an incomplete theoretical under- 
standing (so there is no guarantee that it works well on a 
given data set) or seems impractical due to a very intricate 
and time-consuming evaluation procedure. 

In this paper we present a scalable algorithm that produces 
minimal perfect hash functions using slightly more than 3 
bits per key. Also, if we are happy with values in the range 
{0, . . . , 2n — 1} (i.e., using one bit more than optimal for 
representing the surrogate key) the space usage drops to 
just over 2 bits per key. This is significantly below of what 
has been achieved previously for imaginable values of n. We 
demonstrate the scalability of our algorithm by considering 
a set of over one billion strings (URLs from the world wide 
web of average length 64) , for which we construct a minimal 
perfect hash function on a commodity PC in a little more 
than 1 hour. This is an order of magnitude faster than 
previous methods. If we use the range {0, . . . , 2n — 1}, the 
space for the perfect hash function is less than 324 MB, 
and we still get hash values that can be represented in a 
32 bit word. Thus we believe our MPHF method might 
be quite useful for a number of current and practical data 
management problems. 

2. NOTATION 

Suppose i7 is a universe of keys of size u. Let h : U —> M he 
a hash function that maps the keys from [/ to a given interval 
of integers M = [0, m - 1] = {0, 1, . . . , m - 1}. Let S C U 
he a set of n keys from U, where n <^u. Given a key x £ S, 
the hash function h computes an integer in [0, m — 1] . A 
perfect hash function (PHF) maps a set S of n keys from U 
into a set of m integer numbers without collisions, where m 
is greater than or equal to n. If m is equal to n, the function 
is called minimal (MPHF). 

3. RELATED WORK 

There is a gap between theory and practice among minimal 
perfect hashing methods. On one side, there are good theo- 
retical results without experimentally proven practicality for 
large key sets. We will argue below that these methods are 
indeed not practical. On the other side, there are two cat- 
egories of practical algorithms: the theoretically analyzed 
time and space usage algorithms that assume truly random 



hash functions for their methods, which is an unrealistic as- 
sumption, and the algorithms that present only empirical 
evidences. The aim of this section is to discuss the existent 
gap among these three types of algorithms available in the 
literature. 

3.1 Theoretical results 

In this section we review some of the most important theo- 
retical results on minimal perfect hashing. For a complete 
survey until 1997 refer to Czech, Havas and Majewski |10] . 

Fredman, Komlos and Szemeredi [151 proved, using a count- 
ing argument, that at least n log e -I- log log u — 0(log n) bits 
are required to represent a MPHF, provided that u > n^^-* 
for some j > (an easier proof was given by Radhakrish- 
nan [SO])- Mehlhorn has made this bound almost tight 
by providing an algorithm that constructs a MPHF that can 
be represented with at most n log e-|-log log «-|-0(log n) bits. 
However, his algorithm is far away from practice because its 
construction and evaluation time is exponential on n (i.e., 

^e{ne"u log u)^ 

Schmidt and A. Siegel |31) have proposed the first algo- 
rithm for constructing a MPHF with constant evaluation 
time and description size 0(rz + log log u) bits. Nevertheless, 
the scheme is hard to implement and the constants associ- 
ated with the MPHF storage are prohibitive. For a set of n 
keys, at least 29n bits are used, which means that the space 
usage is similar in practice to schemes using n log n bits. 

More recently, Hagerup and Tholey 117; have come up with 
the best theoretical result we know of. The MPHF ob- 
tained can be evaluated in 0(1) time and stored in nloge-|- 
loglogu -I- 0(n(loglogn)^/logn -I- logloglogu) bits. The 
construction time is 0(n -|- log log u) using 0{n) computer 
words of the Fredman, Komlos and Szemeredi [T5] model 
of computation. In this model, also called the Word RAM 
model, an element of the universe U fits into one machine 
word, and arithmetic operations and memory accesses have 
unit cost. In spite of its theoretical importance, the Hagerup 
and Tholey [17] algorithm is not practical as well, as it 
emphasizes asymptotic space complexity only. (It is also 
very complicated to implement, but we will not go into 
that.) For n < 2^^^° the scheme is not even defined, as 
it relies on splitting the key set into buckets of size n < 
log n/ (21 log log n). Even if we fix this by letting the bucket 
size be at least 1, then for n < 2^"" buckets of size one will 
be used, which means that the space usage will be at least 
(3 log log n + log7)n bits. For a set of a billion keys, this 
is more than 17 bits per element. Since 2^"° exceeds the 
number of atoms in the known universe, it is safe to con- 
clude that the Hagerup- Tholey algorithm will not be space 
efficient in practical situations. 

3.2 Practical results assuming full random- 
ness 

Let us now describe the main practical results analyzed with 
the unrealistic assumption that truly random hash functions 
are available for free. 

Fox et al. ^14] created the first scheme with good average- 
case performance for large datasets, i.e., n ~ 10^. They have 
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designed two algorithms, the first one generates a MPHF 
that can be evaluated in 0(1) time and stored in 0(n log n) 
bits. The second algorithm uses quadratic hashing and 
adds branching based on a table of binary values to get 
a MPHF that can be evaluated in 0(1) time and stored 
in c{n + 1/logn) bits. They argued that c would be typ- 
ically lower than 5, however, it is clear from their exper- 
imentation that c grows with n and they did not discuss 
this. They claimed that their algorithms would run in lin- 
ear time, but, it is shown in Tot Section 6.7] that the al- 
gorithms have exponential running times in the worst case, 
although the worst case has small probability of occurring. 
Fox, Chen and Heath [13] improved the above result to 
get a function that can be stored in cn bits. The method 
uses four truly random hash functions hio : S — * [0,n — 1], 
fell : [0,pi - 1] ^ [0,P2 - 1], hi2 : [pi,n- 1] ^ [P2,b- 1] 
and h2o ■ S x {0, 1} ^ [0, n - 1] to construct a MPHF that 
has the following form: 

h{x) = (h2o{x, d) + g{i(x))) mod n 

j/j.) _ lhiiohio{x) iihio{x)<pi 
\hi2ohio{x) otherwise. 

where pi = 0.6n and P2 ~ 0.3n were experimentally deter- 
mined, and \b = cn/(logn-|- 1)]. Again c is only established 
for small values of n. It could very well be that c grows 
with n. So, the limitation of the three algorithms is that no 
guarantee on the size of the resulting MPHF is provided. 

The family of algorithms proposed by Czech et al [24] uses 
the same assumption to construct order preserving MPHF. 
A perfect hash function h is order preserving if the keys 
in S are arranged in some given order and h preserves this 
order in the hash table. The method uses two truly ran- 
dom hash functions h\(x) : S ^ cn and h2{x) : S —> cn to 
generate MPHFs in the following form: h{x) — {g[hi{x)] + 
g[h2{x)] mod n where c > 2. The resulting MPHFs can be 
evaluated in 0(1) time and stored in 0(n log n) bits (that 
is optimal for an order preserving MPHF). The resulting 
MPHF is generated in expected 0(n) time. Botelho, Ko- 
hayakawa and Ziviani ^ improved the space requirement at 
the expense of generating functions in the same form that 
are not order preserving. Their algorithm is also linear on 
n, but runs faster than the ones by Czech et al [24] and the 
resulting MPHF are stored using half of the space because 
c £ [0.93,1.15]. However, the resuhing MPHFs still need 
0(n log n) bits to be stored. 

Since the space requirements for truly random hash func- 
tions makes them unsuitable for implementation, one has 
to settle for a more realistic setup. The first step in this 
direction was given by Pagh [28]. He proposed a family of 
randomized algorithms for constructing MPHFs of the form 
h{x) — {f{x) + d[g{x)]) mod n, where / and g are chosen 
from a family of universal hash functions and d is a set of 
displacement values to resolve collisions that are caused by 
the function /. Pagh identified a set of conditions concern- 
ing / and g and showed that if these conditions are satisfied, 
then a minimal perfect hash function can be computed in 
expected time 0(n) and stored in (2 + e)nlogn bits. 

Dietzfelbinger and Hagerup [12] improved the algorithm pro- 
posed in [2H], reducing from (2 -I- e)nlogn to (1 -I- e)nlogn 
the number of bits required to store the function, but in 
their approach / and g must be chosen from a class of hash 
functions that meet additional requirements. 



3.3 Empirical results 

In this section we discuss results that present only empirical 
evidences for specific applications. Lefebvre and Hoppe [23] 
have recently designed MPHFs that require up to 7 bits per 
key to be stored and are tailored to represent sparse spatial 
data. In the same trend, Chang, Lin and Chou [7] [^ have 
designed MPHFs tailored for mining association rules and 
traversal patterns in data mining techniques. 

3.4 Differences between our results and pre- 
vious results 

In this work we propose an algorithm that is theoretically 
well-understood and achieves an order-of-magnitude increase 
in the performance on a commodity PC compared to previ- 
ous "practical" methods. To the best of our knowledge our 
algorithm is the first one that demonstrates the capability 
of generating MPHFs for sets in the order of billions of keys, 
and the generated functions require less than 4 bits per key 
to be stored. This increases one order of magnitude in the 
size of the greatest key set for which a MPHF was obtained 
in the literature [5]. This improvement comes mainly from 
the fact that our method is designed to make good use of the 
memory hierarchy. We need 0{N) computer words, where 
N <ti n, for the construction process. Notice that both space 
usage for representing the MPHF and the construction time 
are carefully proven. Additionally, our scheme is much sim- 
pler than previous theoretical well-founded schemes. 

4. THE ALGORITHM 

Our algorithm uses the well-known idea of partitioning the 
key set into a number of small setfl (called "buckets" ) using 
a hash function ho- Let Bi = {x £ S \ ho{x) = i} denote 
the ith bucket. If we define offset[i] — X]j=o l-^^l ^^'^ P» 
denote a MPHF for Bi then clearly 

p{x) = p^{x) + offset[ho{x)] (1) 

is a MPHF for the whole set S. Thus, the problem is reduced 
to computing and storing the offset array, as well as the 
MPHF for each bucket. 

Figure [T] illustrates the two steps of the algorithm: the par- 
titioning step and the searching step. The partitioning step 
takes a key set S and uses a hash function ho to partition S 
into 2' buckets. The searching step generates a MPHF pi 
for each bucket i,0<i<2'' — 1 and computes the offset ar- 
ray. From now on the algorithm used to compute the MPHF 
of each bucket is referred to as internal algorithm and the 
whole scheme is referred to as external algorithm. 

We will choose ha such that it has values in {0, 1}'', for some 
integer h. Since the offset array holds 2^ entries of at least 
log n bits we want 2*" to be less than around n/ log n, making 
the space used for the offset array negligible. On the other 
hand, to allow efficient implementation of the functions pi 
we impose an upper bound t on the size of any bucket. We 
will describe later how to choose Hq such that this upper 
bound holds. 



^Used in e.g. the perfect hash function constructions of 
Schmidt and Siegel [31] and Hagerup and Tholey |17j, for 
suitable definition of "small". 
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Figure 1: Main steps of our algorithm 



To create the MPHFs pi we could choose from a number 
of alternatives, emphasizing either space usage, construc- 
tion time, or evaluation time. We show that all methods 
based on the assumption of truly random hash functions 
can be made to work, with explicit and provably good hash 
functions. For the experiments we have implemented the 
algorithm described in Section [4.21 Since this computation 
is done on a small set, we can expect nearly all memory ac- 
cesses to be "cache hits". We believe that this is the main 
reason why our method performs better than previous ones 
that access memory in a more "random" fashion. 

We consider the situation in which the set of all keys may 
not fit in the internal memory and has to be written on disk. 
Our external algorithm first scans the list of keys and com- 
putes the hash function values that will be needed later on 
in the algorithm. These values will (with high probability) 
distinguish all keys, so we can discard the original keys. It 
is well known that hash values of at least 2 log n bits are 
required to make this work. Thus, for sets of a billion keys 
or more we cannot expect the list of hash values to fit in the 
internal memory of a standard PC. 

To form the buckets we sort the hash values of the keys 
according to the value of ho. Since we are interested in 
scalability to large key sets, this is done using an implemen- 
tation of an external memory mergesort !22!. If the merge 
sort works in two phases, which is the case for all reasonable 
parameters, the total work on the disk consists of reading 
the keys, plus writing and reading the hash function values 
once. Since the ho hash values are relatively small (less than 
15 decimal digits) we can use radix sort to do the internal 
memory sorting. 

The detailed description of the external algorithm is pre- 
sented in Section 14.11 The presentation of the internal al- 
gorithm used to compute the MPHF of each bucket is pre- 
sented in Section [4.21 The internal algorithm uses two hash 
functions hn and hi2 to compute a MPHF pi. These hash 
functions as well as the hash function ho used in the parti- 
tioning step of the external algorithm are described in Sec- 
tion 1131 



4.1 The external algorithm 

In this section we are going to present the implementation 
of the two-step external memory based algorithm and the 



values of the parameters related to the algorithm. The al- 
gorithm is essentially a two-phase multi-way merge sort with 
some nuances to make it work in linear time. 

The partitioning step performs two important tasks. First, 
the variable-length keys are mapped to 128-bit strings by 
using the linear hash function h' presented in Section 14.31 
That is, the variable-length key set S is mapped to a fixed- 
length key set F. Second, the set 5 of n keys is partitioned 
into 2*" buckets, where & is a suitable parameter chosen to 
guarantee that each bucket has at most £ — 256 keys with 
high probability (see Section [4. 3|l . We have two reasons for 
choosing £ = 256. The first one is to keep the buckets size 
small enough to be represented by 8-bit integers. The sec- 
ond one is to allow the memory accesses during the MPHF 
evaluation to be done in the cache most of the time. Figure[2] 
presents the partitioning step algorithm. 



► Let (3 be the size in bytes of the fixed-length key 
set F 

► Let fi be the size in bytes of an a priori reserved 
internal memory area 

► Let A'^ = \l3/lJ-] be the number of key blocks that 
will be read from disk into an internal memory area 

1. for j = 1 to iV do 

1.1 Read a key block Sj from disk (one at a time) 
and store h'{x), for each x £ Sj, into Bj, 



where IB, 



1.2 Cluster Bj into 2 buckets using an indirect radix 
sort algorithm that takes hoix) for x £ Sj as 
sorting key(i.e, the b most significant bits of h'(x)) 

1.3 Dump Bj to the disk into File j 

Figure 2: Partitioning step 

The critical point in Figure [2] that allows the partitioning 
step to work in linear time is the internal sorting algorithm. 
We have two reasons to choose radix sort. First, it sorts 
each key block Bj in linear time, since keys are short integer 
numbers (less than 15 decimal digits). Second, it just needs 
0{\Bj\) words of extra memory so that we can control the 
memory usage independently of the number of keys in S. 

At this point one could ask: why not to use the well known 
replacement selection algorithm to build files larger than the 
internal memory area size? The reason is that the radix sort 
algorithm sorts a block Bj in time 0(|Sj |) while the replace- 
ment selection algorithm requires 0(|Sj| log|B-, |). We have 
tried out both versions and the one using the radix sort algo- 
rithm outperforms the other. A worthwhile optimization we 
have used is the last run optimization proposed by Larson 
and Graefe [22]. That is, the last block is kept in memory 
instead of dumping it to disk to be read again in the second 
step of the algorithm. 

Figure |3ja) shows a logical view of the 2^ buckets gener- 
ated in the partitioning step. In reality, the 128-bit strings 
belonging to each bucket are distributed among many files, 
as depicted in Figure [3jb). In the example of Figure [3jb), 
the 128-bit strings in bucket appear in files 1 and A'^, the 
128-bit strings in bucket 1 appear in files 1, 2 and A'^, and 
so on. 
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a) 



b) 



File 1 File 2 



Buckets Logical View 



Buckets Physical View 



File N 



Figure 3: Situation of the buckets at the end of the 
partitioning step: (a) Logical view (b) Physical view 



j on disk for the first read operation and reads sequentially 
all 128-bit strings x £ F that have the same index i and 
inserts them all in bucket Bt. Finally, statement 1.4 inserts 
in H the triple {i',j,x'), where x' £ F is the first 128-bit 
string read from File j (in statement 1.3) that does not have 
the same bucket address as the previous keys. 



while bucket Bi is not full do 

1.1 Remove {i,j,x) from H 

1.2 Insert x into bucket Bi 

1.3 Read sequentially all 128-bit strings from File j 
that have the same i and insert them into Bi 

1.4 Insert the triple {i',j,x') in H, where x' is 
the first 128-bit string read from File j that 
does not have the same bucket index i 



This scattering of the 128-bit strings in the buckets could 
generate a performance problem because of the potential 
number of seeks needed to read the 128-bit strings in each 
bucket from the A'^ files on disk during the second step. But, 
as we show later on in Section [5.31 the number of seeks can 
be kept small using bufltering techniques. 

The searching step is responsible for generating a MPHF 
for each bucket and for computing the olfset array. Figure!?] 
presents the searching step algorithm. Statement 1 of Fig- 



► Let H he a, minimum heap of size A'^, where the 
order relation in H is given by 
i = x[96, 127] » (32 - b) for x € F 

1. for j = 1 to do { Heap construction } 

1.1 Read the first 128-bit string x from File j on disk 

1.2 Insert {i,j,x) in H 

2. for j = to 2* - 1 do 

2.1 Read bucket Bi from disk driven by heap H 

2.2 Generate a MPHF for bucket Bi 

2.3 offset[i + 1]^ offset[i] + \Bi\ 

2.4 Write the description of MPHFi and offset[i] 
to the disk 



Figure 4: Searching step 

ure [4] constructs the heap H of size A'". This is well known 
to be linear on A^. The order relation in H is given by the 
bucket address i (i.e., the b most significant bits of a; G F). 
Statement 2 has two important steps. In statement 2.1, a 
bucket is read from disk, as described below. In statement 
2.2, a MPHF is generated for each bucket Bi using the in- 
ternal memory based algorithm presented in Section [4.21 In 
statement 2.3, the next entry of the offset array is computed. 
Finally, statement 2.4 writes the description of MPHF; and 
ojJset[i] to disk. Note that to compute offset[i + 1] we just 
need the current bucket size and offset[i]. So, we just need 
to maintain two entries of vector offset in memory all the 
time. 

The algorithm to read bucket Bi from disk is presented in 
Figure [5] Bucket Bi is distributed among many files and 
the heap H is used to drive a multiway merge operation. 
Statement 1.1 extracts and removes triple {i,j,x) from H, 
where i is a minimum value in H . Statement 1.2 inserts x in 
bucket Bi. Statement 1.3 performs a seek operation in File 



Figure 5: Reading a bucket 

It is not difficult to see from this presentation of the search- 
ing step that it runs in linear time. To achieve this con- 
clusion we use 0{N) computer words to allow the merge 
operation to be performed in one pass through each file. In 
addition, it is also important to observe that: 

1. < 1^ (see Section 1131), 

2. N <^n (e.g., see Table[6]in Section lO)) and 

3. the internal algorithm runs in linear time, as shown in 
Section 

In conclusion, our algorithm takes 0(n) time because both 
the partitioning and the searching steps run in 0(n) time. 
The space required for constructing the resulting MPHF is 
0{N) computer words because the memory usage in the 
partitioning step does not depend on the number of keys 
in 5* and, in the searching step, the internal algorithm is 
applied to problems of size up to 256. All together makes 
our algorithm the first one that demonstrates the capability 
of generating MPHFs for sets in the order of billions of keys. 

4.2 The internal algorithm 

We now describe the internal algorithm, a simple and space- 
optimal way of constructing a minimal perfect hash function 
for a set S of n elements. We assume that we can create and 
access two truly random hash functions /o and /i , mapping 
elements of U to {0, . . . , r — 1}, where r > (1 + e)n for some 
constant e > 0. We consider the bipartite graph G with 
vertex set V = {0, ... , 2r — 1} and edge set E = {{fo{x),T + 
fi{x)} \ x £ S}. From the theory of random graphs [H 
118) we know that G is acyclic with probability f2(l), i.e, 
Pr = e^/V(c- 2)/c, where c = 2(1 + e). Thus, in an 
expected constant number of attempts we can find fo and /i 
such that G is acyclic. In fact, this has been used in previous 
MPHF constructions [24], but we will proceed diflFerently to 
achieve a space usage of 0(n) bits rather than 0(n log n) 
bits. 

The data structure for the MPHF consists of two arrays of 
2t bits, Ti and T2. The space usage for this is 4(1 -j- e)n 
bits. We will use Ti to associate a bit with every vertex 
of G. For every connected component of G, which is a tree 
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by choice of /o and /i, we choose an arbitrary root node 
vq in {0, . . . ,T — 1}. Then, for a given non-isolated vertex 
u of G we can speak of its distance dv to the root of its 
component. We let Ti[v] = 1 if d„ mod 4 £ {1,2}, and 
Ti [v] = otherwise. If v is an isolated vertex, we let Ti [v] = 
0. It is easy to implement the initialization of Ti in linear 
time by using a depth-first search algorithm. Now consider 
the following perfect hash function: 

^(^) = / /o{^) if Ti Ifoix)] e Ti [r /i (x)] = 

1 r -f /i (x) otherwise. 

(Again we denote exclusive-or by ©.) Now we will argue that 
the function (j) is 1-1 on S. An element x £ S corresponds 
to the edge e = {fo{x),r + fi{x)} in G. We argue that (j}{x) 
equals the vertex in e that is furthest away from the root 
of the tree in which e is part: This is clearly true for edges 
containing the root element, and the value of Ti[fo{x)] © 
T-i[t + fi{x)] changes at each step on a root-to- leaf path, as 
it should. Thus, (f} is 1-1 on S. 

T2 is used to map down to the interval {0, . . . , n — 1} rather 
than {0, . . . , 2r — 1}. Specifically, we use T2 to store the set 
(t){S) (indicating elements in the set by Is). Note that T2 
can also be computed in linear time. For a set Y of integers, 
let ranky(a;) = \{y & Y \ y < x}\. Then the following is a 
minimal perfect hash function for 5": 

h{x) = rank^(s;)(0(x)) 

We need to specify how to compute the rank function, where 
the set is represented as a bit vector T2. Note that comput- 
ing rank(i) corresponds to counting the number of Is in T2 
in positions 0, . . . , i — 1. This is a well-studied primitive in 
succinct data structures, and it is known that it is possible 
to compute a rank in constant time, by using o(t) additional 
space, see e.g. [29j . 

For completeness, we describe a practical variant that uses 
£T additional bits of space, where e can be chosen as any 
positive number. The evaluation time is 0{l/e). Conceptu- 
ally, the scheme is very simple: Store explicitly the rank of 
every Kth index, where k = [log(r)/eJ . To compute rank(i) 
we look up the rank of the largest precomputed index j < i, 
and count the number of Is from position j to i — 1. To 
do this in time 0{l/e) we use a lookup table that allows us 
to count the number of Is in f2(logr) bits in constant time. 
Such a lookup table takes r"'^^-' space. Note that if, as in this 
paper, we have many MPHFs, they can all share a lookup 
table that may be larger than each individual MPHF, to 
reduce the constant in the rank computation. 

4.2.1 Improving the space 

We now sketch a way of improving the space to just over 
3 bits per key, adding a little complication to the scheme. 
We can notice that the contents of Ti and T2 are not inde- 
pendent. Specifically, there can be a non-zero bit in Ti[i] 
only if T2[i] = 1. We can create a compressed representa- 
tion Tl that uses only n bits and enables us to compute any 
bit of Tl in constant time. First of all, if T2[i] = we can 
conclude that Ti[i] = 0. We want to initialize Ti such that 
Ti\i] = ri'[rank^(s)(i)] whenever Ta^ = 1, i.e., i G (fi{S). 
This is possible since rank0(s)(i) is 1-1 on elements in 0(5*). 
In conclusion, we can replace Ti by Tl and reduce the space 
usage to 2(1 -|- e)T + n bits. It is easy to note that T[ can 
be computed in linear time as well. 



4.2.2 The parameters choice for the internal algo- 
rithm 

The first parameter we are going to discuss is that e respon- 
sible for allowing us to construct an acyclic random graph 
with high probability. We have set e to 0.045 in order to 
get a probability of approximately 0.33 of generating a ran- 
dom graph with no cycles. As a consequence the expected 
number of iterations to generate an acyclic graph is approx- 
imately 3, which comes from 1/Pr. 

The larger is the value of e, the sparser is the random graph 
used and, consequently, the larger is the storage require- 
ments of the resulting MPHFs and the faster is the internal 
algorithm because of the greater probability of getting an 
acyclic random graph. We have chosen a small value for e 
because we are interested in more compact functions and 
the runtime of the internal algorithm is dominated by the 
time spent with I/Os. 

The parameter k is left to be set by the users so that they 
can trade space for evaluation time and vice-versa. In the 
experiments we set k to 128 in order to spend less space 
to store the MPHF of each bucket. This means that we 
store in a rank table the number of bits set to 1 before every 
128th entry in the bit vector T2. As f is upper bounded by 
256, then at most four and typically two 8-bit integers are 
required to store the rank values for each bucket. 

To compute rank(i) we look up the rank of the largest pre- 
computed index j < i, and count the number of Is from 
position j to i — 1. To do this we use a lookup table that 
allows us to count the number of Is in 16 bits in constant 
time. Therefore, to compute the number of bits set to 1 in 
128 bits we need 8 lookups. Such a lookup table takes 2^^ 
bytes of space that are shared for all the buckets. We could 
trade space for evaluation time by using a lookup table of 
2* bytes instead. However, 2^^ bytes is small enough to fit 
in the cache and to have constant evaluation time. 

4.3 Hash functions used by the algorithms 

The aim of this section is threefold. First, in Section [4.3. II 
we define the hash function ho used to split the key set S 
into 2*" buckets and, the hash functions hn and hi2 used by 
the internal algorithm to generate the MPHF of each bucket, 
where < i < 2*' — 1. Second, in Section [4.3.21 we present 
the implementation details of those hash functions. Third, 
in Section 14.3.31 we show the conditions that parameter b 
must meet so that no bucket with more than £ keys is created 
by ho. We also show that hn and hi2 are truly random hash 
functions for the buckets. 

4.3.1 Definitions 

We have made the design decision to make use of tabulation 
based hash functions, which seem to be a more practical 
alternative than hash functions based on integer multiplica- 
tion of keys0 We will make extensive use of the linear hash 

■^For example, as far as we know the best way of implement- 
ing multiplication of two 64-bit integers on contemporary 
machines is by "school method" reduction to 4 multiplica- 
tions of 32-bit integers. Similarly, a " mod p" operation 
on the resulting 128-bit integer, where p is a 32-bit integer, 
seems to require 3 multiplications of 32-bit integers and 4 
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functions analyzed by Alon, Dietzfelbinger, Miltersen and 
Petrank [1]. To do so we consider a key as a 0-1 vector of 
length L. The variable- length strings that we consider are 
conceptually made fixed length by padding with zeros at the 
end, which results in a unique vector since ascii character 
does not appear in any string. 

Mathematically, ho is a randomly chosen linear map over 
GF(2) from {0, 1}^ to {0, 1}'. To get an efficient implemen- 
tation, we use a tabulation idea from [2] where we can get 
evaluation time 0(L/loga) by using space La - see Sec- 
tion |4]3]2] for implementation details. Choosing cr = rP'^^^ 
we obtain evaluation time 0{L/ \ogn). (In theory we could 
get evaluation time 0{L/w), where w > logn is the word 
length of the computer, by first hashing down to 0(log n) 
bits using universal hashing; however, this does not seem to 
give an improvement in practice.) We choose b as small as 
possible such that the maximum bucket size is bounded by 
£ with reasonable probability (some constant close to 1). By 
a result of |T] we know that 



fe < logn - log(^/log^) + 0(1). 



(2) 



For the implementation, we will experimentally determine 
the smallest possible choices of b. 

To define hn and hi2 we proceed as follows. Again use 
the linear hash function of [1] to implement hash functions 

yi,...,yk from {0, 1}^ to {0, 1}''"^0, where r > log^ and k 
are parameters to be determined later. Note that the range 
is the set of r-bit strings ending with a 0. The purpose of 
the last is to ensure that we can have no collision between 
yj{x\) and yj{x2) © 1, 1 < J < fc, for any pair of elements 
x\ and X2- Let p be a prime number much larger than the 
size of the desired range of hn and hi2, which in our case 
is \Bi\, and let t\,. . . ,t2k be tables of 2^ random values in 
{0, . . . ,p — 1}. We then define: 

/ fe 2k \ 

p(x, s. A) = ^ (a;) ® A]-H s ^ tj [y^ (x) ® A] mod p 

hii(x) = p(x, Si, 0) mod \Bi\ 



hi2(x) = p{x, Si, 1) mod \Bi\ 



(3) 



where the symbol © denotes exclusive-or and the variable 
Si is specific to bucket i. To find Si we choose random val- 
ues from {1, . . . ,p — 1} until the functions hn and hi2 work 
with the internal algorithm of Section [4.21 It is known that 
a constant fraction of the set of all functions work; in Sec- 
tion |4]3]3] we will argue that this will also be the case when 
the hash functions are chosen as above. 

4.3.2 Implementation details 

In order to implement the functions ho, y\,y2,y-i, . ■ ■ ,yk to 
be computed at once we use a function h' from a family 
of linear hash functions over GF(2) proposed by Alon, Di- 
etzfelbinger, Miltersen and Petrank [T]. The function has 
the following form: h' (x) = Ax, where x £ S and ^4 is a 
7 X L matrix in which the elements are randomly chosen 
from {0, 1}. The output is a bit string of an a priori defined 
size 7. In our implementation 7 = 128 bits. It is impor- 
tant to realize that this is a matrix multiplication over GF 
(2). The implementation can be done using a bitwise-and 
operator (&) and a function / : {0, 1}'' {0, 1} to compute 
parity instead of multiplying numbers. The parity function 
/(a) produces 1 as a result if a G {0, 1}^ has an odd number 

modulo operations on 64-bit integers. 



of bits set to 1, otherwise the result is 0. For example, let 
us consider L = 3 bits, 7 = 3 bits, x — 110 and 



1 1 

1 

1 1 



The number of rows gives the required number of bits in the 
output, i.e., 7 = 3. The number of columns corresponds to 
the value of L. Then, 



h'(x) : 



where fei = /(lOl & 110) 
63 = /(110 &110) =0. 



1 





1 ' 




1 ' 




61 








1 




1 




62 


1 


1 












63 



1, &2 = /(OOl & 110) = and 



To get a fast evaluation time, some tabulation is required. 
Note that if x is short, e.g. 8 bits, we can simply tabulate 
all the function values and compute h'{x) by looking up the 
value h'[x] in an array h' . To make the same thing work 
for longer keys, split the matrbc A into parts of 8 columns 
each: A = j4i|A2| . . . |j4p/^/8-|, and create a lookup table h'i 
for each submatrix. Similarly split x into parts of 8 bits, 
X = X1X2 ■ ■ .a;p^/8] . Now h'{x) is the exclusive-or of /ii[a;i], 
for i = 1, . . . , [L/8] . Therefore, we have set a to 256 so 
that keys of size L can be processed in chunks of log a = 8 
bits. In our URL collection the largest key has 65 bytes, i.e., 
L = 520 bits. 

The 32 most significant bits of h'{x), where x £ S, are 
used to compute the bucket address of x, i.e., ho{x) = 
h'(x)[96,127] » (32 - b). We use the symbol >> to de- 
note the right shift of bits. The other 96 bits correspond to 
2/1(2;), 3/2(2;), . . . ye{x), taking fc = 6. This would give r = 16, 
however, to save space for storing the tables used for com- 
puting hii and hi2, we hard coded the linear hash function 
to make the most and the least significant bit of each chunk 
of 16 bits equal to zero. Therefore, r = 15. This setup en- 
able us to solving problems of up to 500 billion keys, which is 
plenty of for all the applications we know of. If our algorithm 
fails in any phase, we just restart it. As the parameters are 
chosen to have success with high probability, the number of 
reinitializations is 0(1). 

Finally, the last parameter related to the hash functions we 
need to talk about is the prime number p. As p must be 
much larger than the range of hn and hi2, then we set 
it to the largest 32-bit integer that is a prime, i.e, p = 
4294967291. 

4.3.3 Analytical results 

In this section we sketch the analysis of the hash functions 
of our scheme. Note that the hash functions ho, yi, J/2, . • • , 
j/fc have a range of b + kr bits in total. Thus, by universality 
of linear hash functions [1], the probability that there exist 
two keys that have the same values under all functions is at 
most (2) /2''^''^ . We will choose r such that this probability 
becomes negligible. For simplicity, we assume that the zero 
vector 0^ is not in the set 5 - it is not hard to see that this 
assumption is insignificant. 

A direct consequence of Theorem 5 in [T] is that, assuming 
b < log n— log log n, the expected size of the largest bucket is 
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0(n log &/2''), i.e., a factor O(logfe) from the average bucket 
size. This justifies the choice of b in Eq. ((2}, imposing the 
requirement that £ > log n log log n. 

For any choice of s, we will now analyze the probability 
(over the choice of t/i, . . . , yk) that x p[x, s, 0) and x i— > 
p{x,s,l) map the elements of Bi uniformly and indepen- 
dently to {0, . . . ,p — 1}. A sufficient criterion for this is 
that the sums Yl*j=i [Vj i^) ® ^] a^nd Xli=fe+i *j [Vi (2:) ffi A] , 
A G {0, 1}, have values that are uniform in {0, . . . ,p — 1} 
and independent. This is the case if for every x £ Bi there 
exists an index such that neither yj^ or yj^ © 1 belongs 
to yj^{Bi — {x\). Since yi,. . . ,yk are universal hash func- 
tions, the probability that this is not the case for a given 
element x € B, is bounded by {\Bi\/2'')'' < (IjTrf . If we 
choose, for example r — \\og{-^/n£)] and A; = 4 we have that 
this probability is o(l/n). Hence, the probability that this 
happens for any key in S is o(l). 

Finally, we need to argue that for each bucket i it is easy to 
find a value of s such that the pair hn, hi2 is good for the 
MPIfF of the bucket. We know that with constant prob- 
ability this is the case if the functions were truly random. 
Now, as argued above, with probability 1 — o(l) the functions 
X p{x,s,0) and x t-^ p{x,s,l) are random and indepen- 
dent on each bucket, for every value of s. Then, for a given 
bucket and a given value of s there is a probability 
that the pair of hash functions work for that bucket. Now, 
for any A £ {0, 1} and s 7^ s' , the functions x 1— > p{x,s,A) 
and X I— > p{x, s' , A) are independent. Thus, by Chebychev's 
inequality the probability that less than a constant fraction 
of the values of s work for a given bucket is 0{l/p). So with 
probability 1 — o(l) there is a constant fraction of "good" 
choices of s in every bucket, which means that trying an ex- 
pected constant number of random values for s is sufficient 
in each bucket. 

5. EXPERIMENTAL RESULTS 

In this section we present the experimental results. We start 
presenting the experimental setup. We then present the per- 
formance of our algorithms considering construction time, 
storage space and evaluation time as metrics for the result- 
ing MPHFs. Finally, we discuss how the amount of internal 
memory available affects the runtime of our two-step exter- 
nal memory based algorithm. 

5.1 The data and the experimental setup 

The algorithms were implemented in the C language and 
are available at http://www.dcc.ufmg.br/~fbotelho under 
the GNU Lesser General Public License (LGPL). All experi- 
ments were carried out on a computer running the Linux op- 
erating system, version 2.6, with a 1 gigahertz AMD Athlon 
64 Processor 3200-1- and 1 gigabyte of main memory. 

Our data consists of a collection of approximately 1 billion 
URLs collected from the Web, each URL 64 characters long 
on average. The collection is stored on disk in 60.5 gigabytes 
of space. 

5.2 Performance of the algorithms 

We are firstly interested in verifying the claim that our two- 
step external memory based algorithm runs in linear time. 



Therefore, we run the algorithm for several numbers n of 
keys in S. The values chosen for n were 1, 2, 4, 8, 16, 32, 
64, 128, 512 and 1024 million. We limited the main mem- 
ory in 512 megabytes for the experiments in order to show 
that the algorithm does not need much internal memory to 
generate MPHFs. The size ^ of the a priori reserved inter- 
nal memory area was set to 200 megabytes. In Section [5.31 
we show how p, affects the runtime of the algorithm. The 
parameter b (see Eq. ([2])) was set to the minimum value 
that gives us a maximum bucket size lower than £ = 256. 
For each value chosen for n, the respective values for b are 
13, 14, 15, 16, 17, 18, 19, 20, 22 and 23 bits. 

In order to estimate the number of trials for each value of n 
we use a statistical method for determining a suitable sam- 
ple size (see, e.g., [191 Chapter 13]). We got that just one 
trial for each n would be enough with a confidence level 
of 95%. However, we made 10 trials. This number of trials 
seems rather small, but, as shown below, the behavior of our 
external algorithm is very stable and its runtime is almost 
deterministic (i.e., the standard deviation is very small) be- 
cause it is a random variable that follows a (highly concen- 
trated) normal distribution. 

Table [1] presents the runtime average for each n, the re- 
spective standard deviations, and the respective confidence 
intervals given by the average time ± the distance from av- 
erage time considering a confidence level of 95%. Observing 
the runtime averages we noticed that the algorithm runs in 
expected linear time, as we have claimed. Better still, it 
outputs the resulting MPHF faster than all practical algo- 
rithms we know of, because of the following reasons. First, 
the memory accesses during the generation of a MPHF for a 
given bucket cause cache hits, once the problem was broken 
down into problems of size up to 256. Second, at searching 
step we are dealing with 16-byte (128-bit) strings instead of 
64-byte URLs. 

Figure [6] presents the runtime for each trial. In addition, 
the solid line corresponds to a linear regression model ob- 
tained from the experimental measurements. As we were 
expecting the runtime for a given n has almost no variation. 
The percentages of the total time spent in the partitioning 
step and in the searching are approximately 49% and 51%, 
respectively. 

An intriguing observation is that the runtime of the algo- 
rithm is almost deterministic, in spite of the fact that it uses 
as building block an algorithm with a considerable fluctua- 
tion in its runtime. A given bucket i, < i < 2*, is a small 
set of keys (at most 256 keys) and, the runtime of the build- 
ing block algorithm is a random variable Xi with high fluctu- 
ation (it follows a geometric distribution with mean 1 / Pr « 
3). However, the runtime Y of the searching step of our 
external algorithm is given by F = '}2o<i<2'' Under the 
hypothesis that the Xt are independent and bounded, the 
law of large numbers (see, e.g., [19| ) implies that the ran- 
dom variable converges to a constant as n — > 00. This 
explains why the runtime is almost deterministic. 

The next important metric on MPHFs is the space required 
to store the functions. In order to apply the internal algo- 
rithm to larger sets we randomly choose /o and /i from the 
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n (millions) 


1 


2 


4 


8 


16 


Average time (s) 
SD 


3.34 ± 0.02 
0.03 


6.97 ± 0.02 
0.03 


14.64 ± 0.04 
0.05 


31.75 ± 0.49 
0.73 


68.98 ± 0.82 
1.22 


n (millions) 


32 


64 


128 


512 


1024 


Average time (s) 
SD 


142.71 ± 1.44 
2.01 


288.95 ± 2.65 
3.70 


604.70 ± 6.22 
8.69 


2383.08 ± 22.11 
28.77 


4982.97 ± 55.14 
51.12 



Table 1: External algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), 
and the confidence intervals considering a confidence level of 95%. 



o 
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Number of keys (millions) 



Figure 6: Partitioning time and searching time ver- 
sus number of keys in S for our external algorithm. 
The solid line corresponds to a linear regression 
model for the total time. 



family of universal hash functions proposed by Thorup [33] . 
The internal algorithm was analyzed under the full random- 
ness assumption so that universal hashing is not enough to 
guarantee that it works for every key set. But it has been 
the case for every key set we have applied it to. Then, we 
refer to this version as heuristic internal algorithm. 

Table [2] shows how many bits per key the heuristic internal 
algorithm requires to store the resulting MPHFs. In our 
setup the heuristic internal algorithm requires around 2.1 
and 3.3 bits per key to respectively store the resulting PHFs 
and MPHFs. In a PC with 1 gigabyte of main memory the 
largest set we are able to generate a MPHF for is a set with 
30 millions of keys, because of the sparse graph required to 
generate the functions is memory demanding. 



n 


Bits/key 


PHF 


MPHF 


10* 


2.13 


3.37 


lO'^ 


2.09 


3.32 


lO" 


2.09 


3.32 


10^ 


2.09 


3.32 



Table 2: Heuristic internal algorithm: space us- 
age to respectively store the resulting PHFs and 
MPHFs. 

The external algorithm is designed to be used when the key 
set does not fit in main memory. Table [3] shows that it 
can be used for constructing PHFs and MPHFs that require 
approximately 2.7 and 3.8 bits per key to be stored, respec- 
tively. The lookup tables used by the hash functions of the 



external algorithm require a fixed storage cost of 1,847,424 
bytes. This makes the external algorithm unsuitable for sets 
with less than 16 million of keys. 



n 


6 


Cost in bytes to 
store lookup tables 


Bits/key 


PHF 


MPHF 


10* 


6 


1,847,424 


2.93 


3.71 


10= 


9 


1,847,424 


2.73 


3.57 


lO" 


13 


1,847,424 


2.65 


3.82 


lO'' 


16 


1,847,424 


2.51 


3.70 


10** 


20 


1,847,424 


2.80 


4.02 


10^ 


23 


1,847,424 


2.65 


3.83 



Table 3: External algorithm: space usage to respec- 
tively store the resulting PHFs and MPHFs. 

To overcome the problem mentioned above we have imple- 
mented a version of the external algorithm that uses the 
pseudo random hash function proposed by Jenkins [20] . This 
function was used instead of the linear hash function de- 
scribed in Section 14.3.21 and instead of the two truly ran- 
dom hash function of each bucket, i.e., hn and hi2, where 
< i < 2^ This version is, from now on, referred to as 
heuristic external algorithm. The Jenkins function just loops 
around the key doing bitwise operations over chunks of 12 
bytes. Then, it returns the last chunk. Thus, in the map- 
ping step, the key set S is mapped to F, which now contains 
12-byte long strings instead of 16-byte long strings. 

The Jenkins function needs just one random seed of 32 bits 
to be stored instead of quite long lookup tables. Therefore, 
there is no fixed cost to store the resulting MPHFs, but two 
random seeds of 32 bits are required to describe the func- 
tions hii and hi2 of each bucket. As a consequence, the 
MPHFs generation and the MPHFs efficiency at retrieval 
time are faster (see Table |4] and [5]). The reasons are twofold. 
First, we are dealing with 12-byte strings instead of 16-byte 
strings. Second, there are no large lookup tables to cause 
cache misses. For example, the construction time for a set 
of 1024 million keys has dropped down to 1.04 hours in the 
same setup. The disadvantage of using the Jenkins function 
is that there is no formal proof that it works for every key 
set. That is why the hash functions we have designed in this 
paper are required, even being slower. In the implementa- 
tion available, the hash functions to be used can be chosen 
by the user. 

Table |4] presents a comparison of our methods with the 
ones proposed by Pagh [JSj (Hash-displace), by Botelho, Ko- 
hayakawa and Ziviani [5] (BKZ), by Czech, Havas and Ma- 
jewski IS] (CHM), and by Fox, Chen and Heath [l3] (FCH), 
considering construction time and storage space as metrics. 
The form of the MPHFs generated by those methods is pre- 
sented in SectionO Notice that they are the most important 
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practical results on MPHFs known in the literature. Observ- 
ing the results, our heuristic internal algorithm is the best 
choice for sets that can be handled in main memory and our 
external algorithm is the first one that can be applied to sets 
that do not fit in main memory. 



Time in seconds to construct a MPHF for 2 X 10*^ keys 


Algorithms 


Function 
type 


Construction 
time (seconds) 


bits/key 


Heuristic Internal 
Algorithm 


PHF 


12.99 ± 1.01 


2.09 


MPHF 


13.94 ± 1.06 


3.35 


External 
Algorithm 


PHF 


6.92 ± 0.04 


2.64 


MPHF 


6.98 ± 0.01 


3.85 


Heuristic External 
Algorithm 


MPHF 


4.75 ± 0.02 


3.7 


Hash-displace 
BKZ 
CHM 
FCH 


MPHF 
MPHF 
MPHF 
MPHF 


46.18 ± 1.06 
8.57 ± 0.31 
13.80 ± 0.70 
758.66 ± 126.72 


64.00 
32.00 
66.88 
5.84 



Table 4: Construction time and storage space with- 
out considering the fixed cost to store lookup tables. 

Finally, we show how efficient is the resulting MPHFs at 
retrieval time for the methods aforementioned, which is as 
important as construction time and storage space. Table [S] 
presents the time, in seconds, to evaluate 2 x 10® keys. 
We group the BKZ and CHM methods together because 
the resulting MPHFs have the same form. From the re- 
sults we can conclude that our heuristic internal algorithm 
generates MPHFs that are as fast to be computed as the 
ones generated by the most practical methods on MPHFs. 
The MPHFs generated by the external algorithm are slower. 
Nevertheless, the difference is not so expressive (each key 
can be evaluated in few microseconds) and the external al- 
gorithm is the first efficient option for sets that do not fit in 
main memory. 



Time in seconds to evaluate 2 X 10° keys 


key length (bytes) 


Function 
type 


8 


16 


32 


64 


128 


Heuristic Internal 


PHF 


0.41 


0.55 


0.79 


1.29 


2.39 


Algorithm 


MPHF 


0.85 


0.99 


1.23 


1.73 


2.74 


External 


PHF 


2.05 


2.31 


2.84 


3.99 


7.22 


Algorithm 


MPHF 


2.55 


2.83 


3.38 


4.63 


8.18 


Heuristic External 
Algorithm 


MPHF 


1.19 


1.35 


1.59 


2.11 


3.34 


Hash-displace 


MPHF 


0.56 


0.69 


0.93 


1.44 


2.54 


BKZ/CHM 


MPHF 


0.61 


0.74 


0.98 


1.48 


2.58 


FCH 


MPHF 


0.58 


0.72 


0.96 


1.46 


2.56 



Table 5: Evaluation time. 

It is important to emphasize that the BKZ, CHM and FCH 
methods were analyzed under the full randomness assump- 
tion as well as our heuristic internal algorithm. Therefore, 
our external algorithm is the first one that has experimen- 
tally proven practicality for large key sets and has both space 
usage for representing the resulting functions and the con- 
struction time carefully proven. Additionally, it is the fastest 
algorithm for constructing the functions and the resulting 
functions are much simpler than the ones generated by pre- 
vious theoretical well-founded schemes so that they can be 
used in practice. Also, it considerably improves the first step 
given by Pagh with his hash and displace method [28] . 



5.3 Controlling disk accesses 

In order to bring down the number of seek operations on 
disk we benefit from the fact that our external algorithm 
leaves almost all main memory available to be used as disk 
I/O buffer. In this section we evaluate how much the pa- 
rameter fi affects the runtime of our external algorithm. For 
that we fixed n in approximately 1 billion of URLs, set the 
main memory of the machine used for the experiments to 
1 gigabyte and used fi equal to 100, 200, 300, 400 and 500 
megabytes. 

In order to amortize the number of seeks performed we use 
a buffering technique |2T]. We create a buffer j of size B = 
fi/N for each file j, where 1 < j < N . Every time a read 
operation is requested to file j and the data is not found 
in the jth buffer, B bytes are read from file j to buffer j. 
Hence, the number of seeks performed in the worst case 
is given by /3/B (remember that (3 is the size in bytes of 
the fixed- length key set F). For that we have made the 
pessimistic assumption that one seek happens every time 
buffer j is filled in. Thus, the number of seeks performed 
in the worst case is 16n/B, since after the partitioning step 
we are dealing with 128-bit (16-byte) strings instead of 64- 
byte URLs, on average. Therefore, the number of seeks is 
linear on n and amortized by B. It is important to emphasize 
that the operating system uses techniques to diminish the 
number of seeks and the average seek time. This makes the 
amortization factor to be greater than B in practice. 

Table |6] presents the number of files A^, the buffer size used 
for all files, the number of seeks in the worst case considering 
the pessimistic assumption aforementioned, and the time to 
generate a (minimal)PHF for approximately 1 billion of keys 
as a function of the amount of internal memory available. 
Observing Table [6] we noticed that the time spent in the 
construction decreases as the value of /i increases. However, 
for /i > 400, the variation on the time is not as significant 
as for fi < 400. This can be explained by the fact that 
the kernel 2.6 1/0 scheduler of Linux has smart policies for 
avoiding seeks and diminishing the average seek time (see 
http : //www. linuxjournal . com/article/6931). 



(MB) 


100 


200 


300 


400 


500 


N (files) 


245 


99 


63 


46 


36 


B (in KB) 


418 


2,069 


4, 877 


8, 905 


14, 223 


/3/B 


151,768 


30, 662 


13,008 


7, 124 


4, 461 


Time (hours) 


1.58 


1.37 


1.33 


1.32 


1.32 



Table 6: Influence of the internal memory area size 
(fi) in our external algorithm runtime. 



6. CONCLUDING REMARKS 

This paper has presented two novel algorithms for construct- 
ing PHFs and MPHFs and three implementations of the al- 
gorithms. The implementations in the C language are avail- 
able at http: //anonymous under the GNU Lesser General 
Public License (LGPL). 

The first algorithm, referred to as internal algorithm, as- 
sumes that two truly random hash functions /o and /i are 
available for free so that a PHF or a MPHF can be con- 
structed from the acyclic random graph induced by fo and 
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/i. The full randomness assumption is not realistic because 
each truly random hash functions would require at least 
n log n bits to be stored, which is memory demanding. 

In order to compare the internal algorithm with the most im- 
portant practical results on MPHFs that consider the same 
assumption (see Section[31) we have chosen the required hash 
functions from the family of universal hash functions pro- 
posed by Thorup [33) . As universal hash functions are not 
enough to guarantee that the algorithm would work for ev- 
ery key set, then we have referred to this version of the 
algorithm as heuristic internal algorithm. 

The heuristic internal algorithm outperforms all previous 
methods considering the storage space required for the re- 
sulting functions. The resulting PHFs and MPHFs require 
approximately 2.1 and 3.3 bits per key to be stored, respec- 
tively. Better still, the resulting functions are almost as fast 
to be computed and generated as the ones coming from pre- 
vious methods known in the literature. Tables [2] [4] and [5] 
summarize the experimental results. 

The second algorithm, referred to as external algorithm, 
contains, as a component, a provably good implementa- 
tion of the internal memory algorithm. This means that 
the two hash functions hn and hi2 (see Eq. Q) used in- 
stead of fo and /i behave as truly random hash functions 
(see Section \4j% . The resuhing PHFs and MPHFs re- 
quire approximately 2.7 and 3.8 bits per key to be stored 
and are generated faster than the ones generated by all pre- 
vious methods (including our heuristic internal algorithm). 
The external algorithm is the first one that has experimen- 
tally proven practicality for sets in the order of billions of 
keys and has time and space usage carefully analyzed with- 
out unrealistic assumptions. As a consequence, the external 
algorithm will work for every key set. 

The resulting functions of the external algorithm are ap- 
proximately four times slower than the ones generated by 
our heuristic internal algorithm and by all previous practi- 
cal methods (see Table [5| . The reason is that to compute 
the involved hash functions we need to access lookup tables 
that do not fit in the cache. To overcome this problem, at 
the expense of losing the guarantee that it works for every 
key set, we have proposed a heuristic version of the exter- 
nal algorithm that uses a very efficient pseudo random hash 
function proposed by Jenkins [20] . The resulting functions 
require the same storage space, are now less than two times 
slower to be computed and are still faster to be generated. 

Besides the data management applications for minimal per- 
fect hash functions described in Section[T] the external algo- 
rithm will be very useful for the information retrieval com- 
munity as well. Search engines are nowadays indexing tens 
of billions of pages and the work with huge collections is 
becoming a daily task. For instance, the simple assignment 
of number identifiers to web pages of a collection can be a 
challenging task. While traditional databases simply cannot 
handle more traffic once the working set of URLs does not 
fit in main memory anymore [32], the external algorithm we 
propose here to construct MPHFs can easily scale to bil- 
lions of entries. Also, algorithms like PageRank [6j, which 
uses the web link structure to derive a measure of popularity 



for Web pages, operates on the web graph. At construction 
time of the graph, the URLs must be mapped to integers 
that will be used to label the vertices. For the same reason, 
the WebGraph research group ^ would also benefit from a 
MPHF for sets in the order of billions of URLs to scale and 
to improve the storage requirements of their algorithms on 
Graph compression. 
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